Clustering large data sets described with discrete distributions and its application on TIMSS data set
نویسندگان
چکیده
Symbolic Data Analysis is based on a special descriptions of data – symbolic objects. Such descriptions preserve more detailed information about data than the standard representations with mean values. A special kind of symbolic object is also representation with distributions. In the clustering process this representation enables us to consider the variables of all types at the same time. We present two clustering methods based on the data descriptions with discrete distributions: the adapted leaders method and the adapted agglomerative hierarchical clustering Ward’s method. Both methods are compatible – they can be viewed as two approaches for solving the same clustering optimization problem. In the obtained clustering, the leader is assigned to each cluster. The descriptions of the leaders offer simple interpretation of the clusters’ characteristics. The leaders method enables us to efficiently solve clustering problems with large number of units; while the agglomerative method is applied on the obtained leaders and enables us to decide upon the right number of clusters on the basis of the corresponding dendrogram. Both methods were successfully applied in analyses of different data sets. In the paper an application on the TIMSS data set is presented. The University of Ljubljana, Faculty of Economics, Department of Statistics, [email protected] (corresponding author) University of Ljubljana, Faculty of Mathematics and Physics, Department of Mathematics, [email protected] The Educational Research Institute, Slovenia, [email protected] 1 descriptions with distributions enable us to combine two data sets: answers of teachers and answers of their students, into one data set. The descriptions of the obtained clusters enable us to interpret the results in more understandable way.
منابع مشابه
Clustering Large Data Sets Described With Discrete Distributions and An Application on TIMSS Data Set
Symbolic Data Analysis is based on a special descriptions of data – symbolic objects. Such descriptions preserve more detailed information about data than the usual representations with mean values. A special kind of symbolic object is also representation with distributions. In the clustering process this representation enables us to consider the variables of all types at the same time. We pres...
متن کاملخوشهبندی دادهها بر پایه شناسایی کلید
Clustering has been one of the main building blocks in the fields of machine learning and computer vision. Given a pair-wise distance measure, it is challenging to find a proper way to identify a subset of representative exemplars and its associated cluster structures. Recent trend on big data analysis poses a more demanding requirement on new clustering algorithm to be both scalable and accura...
متن کاملSkew-slash distribution and its application in topics regression
In many issues of statistical modeling, the common assumption is that observations are normally distributed. In many real data applications, however, the true distribution is deviated from the normal. Thus, the main concern of most recent studies on analyzing data is to construct and the use of alternative distributions. In this regard, new classes of distributions such as slash and skew-sla...
متن کاملA Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach
In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...
متن کاملThe Family of Scale-Mixture of Skew-Normal Distributions and Its Application in Bayesian Nonlinear Regression Models
In previous studies on fitting non-linear regression models with the symmetric structure the normality is usually assumed in the analysis of data. This choice may be inappropriate when the distribution of residual terms is asymmetric. Recently, the family of scale-mixture of skew-normal distributions is the main concern of many researchers. This family includes several skewed and heavy-tailed d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Statistical Analysis and Data Mining
دوره 4 شماره
صفحات -
تاریخ انتشار 2011